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Abstract 

Following a review of metric, ultrametric and generalized ultrametric, 
we review their application in data analysis. We show how they allow 
us to explore both geometry and topology of information, starting with 
measured data. Some themes are then developed based on the use of 
metric, ultrametric and generalized ultrametric in logic. In particular we 
study approximation chains in an ultrametric or generalized ultrametric 
context. Our aim in this work is to extend the scope of data analysis 
by facilitating reasoning based on the data analysis; and to show how 
quantitative and qualitative data analysis can be incorporated into logic 
programming. 



1 Introduction 

The applicability of metric spaces to applications related to logic has long been 
known. For example Lawvere [HO [2D] starts with the observation of the analogy 
of the triangular inequality and a categorical composition law. A comprehensive 
survey of this area can be found in [5o] . 

Hierarchies as used in data analysis are presented in terms of finding various 
forms of symmetry in data in |29j . We could describe hierarchy built from 
pairwisc dissimilarities as a "precision tool" for data mining; and hierarchies 
built from the generalized ultrametric (see section [4j as leading to a "power 
tool" for data mining. The former is (without special algorithmic speedups) 
typically quadratic or 0(n 2 ) in its computational requirement. The latter can 
be linear or 0(n) in its computation. Here n relates to number of observations. 



We begin in section[2]with data analysis. We motivate the hierarchical struc- 
turing of data, describing at a general level how the geometry and the topology 
of information come into play, related respectively to metric and ultrametric 
embedding of data. 

In section [3] we show how hierarchy, induced from data, can be made use 
of for approximating data. The latter, approximating data, is applicable and 
important for computational purposes. 

In logic, chains of implications or conditionals have to be analyzed. When 
we consider a partial order of conditionals, then the framework of spherical 
(ultrametric) completeness or inductive limit (sections 4.1 and especially 3.1| 
become very useful indeed. 

In section [4.1| we will look at how, [5], a "computable real number is ... 
the lub [least upper bound] of a shrinking sequence of rational intervals which is 
generated by a master program" , and therefore how a real number is computable 
"in the interval approach to computability on the real line" . 

The convergence to fixed points that are based on a generalized ultrametric 
system is precisely the study of spherically complete systems and expansive 
automorphisms discussed in section |3.1[ As expansive automorphisms we see 
here again an example of data and information symmetry at work. 



2 From Metric to Ultrametric Topology 

We will discuss how an ultrametric topology - a tree structuring of the data - 
is induced from data, using pairwise dissimilarities. 

2.1 Pairwise Dissimilarities 

Given an observation set, X, we define dissimilarities as the mapping d : 
X x X — > M. + , where M. + are the positive reals. A dissimilarity is a positive, 
definite, symmetric measure (i.e., d(x,y) > 0;d(x, y) = if X = y;d(x,y) — 
d(y,x)). If in addition the triangular inequality is satisfied (i.e., d(x,y) < 
d(x, z) + d(z, y), Vx, X) then the dissimilarity is a distance. 

2.1.1 From Dissimilarities to an Ultrametric 

If X is endowed with a metric, then we now describe how this metric is mapped 
onto an ultrametric. In practice, there is no need for X to be endowed with a 
metric. Instead a dissimilarity is satisfactory. 

A hierarchy, H, is defined as a binary, rooted, node-ranked tree, also termed 
a dendrogram [3l [16j EH EI] • A hierarchy defines a set of embedded subsets of a 
given set of objects X, indexed by the set I. These subsets are totally ordered 
by an index function v, which is a stronger condition than the partial order 
required by the subset relation. A bijection exists between a hierarchy and an 
ultrametric space. 

Let us show these equivalences between embedded subsets, hierarchy, and 
binary tree, through the constructive approach of inducing H on a set /. 
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Figure 1: The triangular inequality defines a metric: every triplet of points 
satisfies the relationship: d(x, z) < d(x, y) + d{y, z) for distance d. 



Hierarchical agglomeration on n observation vectors with indices i £ I in- 
volves a series of 1, 2, . . . , n — 1 pairwise agglomerations of observations or clus- 
ters, with the following properties. A hierarchy H = {q\q £ 2 1 } such that (i) I £ 
H, (ii) i € HVi, and (hi) for each q £ H,q' £ H : qf\q' ^ q C q' or q' C q. 
Here we have denoted the power set of set / by 2 1 . An indexed hierarchy is 
the pair (H,v) where the positive function defined on H, i.e., v : H — >• R + , 
satisfies: v(i) = if i £ H is a singleton; and q C q' =>■ < ^(g')- Here 
we have denoted the positive reals, including 0, by K + . Function is the ag- 
glomeration level. Take q C q' , let C g" and q' C <?", and let q" be the lowest 
level cluster for which this is true. Then if we define D(q,q') = v{q") : D is an 
ultrametric. In practice, we start with a Euclidean or alternative dissimilarity, 
use some criterion such as minimizing the change in variance resulting from the 
agglomerations, and then define v(q) as the dissimilarity associated with the 
agglomeration carried out. 

2.2 Metric and Ultrametric for Geometry and Topology 
of Information 

The geometry of information is a term and viewpoint used by [37 . The tri- 
angular inequality holds for metrics. An example of a metric is the Euclidean 
distance, exemplified in Figure [T] where each and every triplet of points sat- 
isfies the relationship: d(x, z) < d{x, y) + d(y, z) for distance d. Two other 
relationships also must hold. These are symmetry and positive definiteness, 
respectively: d(x, y) = d{y, x), and d{x, y) > if x ^ y, d(x, y) — if x = y. 
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Property 1 

Figure 2: The query is on the far right. While we can easily determine the 
closest target (among the three objects represented by the dots on the left), is 
the closest really that much different from the alternatives? 



We come now to a different principle: that of the topology of information. 
The particular topology used is that of hierarchy. Euclidean embedding provides 
a very good starting point to look at hierarchical relationships. An innovation in 
our work is as follows: the hierarchy takes sequence, e.g. timeline, into account. 
This captures, in a more easily understood way, the notions of novelty, anomaly 
or change. 

Let us take an informal case study to see how this works. Consider the 
situation of seeking documents based on titles. If the target population has 
at least one document that is close to the query, then this is (let us assume) 
clearcut. However if all documents in the target population are very unlike the 
query, does it make any sense to choose the closest? Whatever the answer here 
we are focusing on the inherent ambiguity, which we will note or record in an 
appropriate way. Figure [2] illustrates this situation, where the query is the point 
to the right. 

By using approximate similarity this situation can be modeled as an isosceles 
triangle with small base, as illustrated in Figure [2j An ultrametric space has 
properties that are very unlike a metric space, and one such property is that the 
only triangles allowed are either (i) equilateral, or (ii) isosceles with small base. 
So Figure [2] can be taken as representing a case of ultrametricity. What this 
means is that the query can be viewed as having a particular sort of dominance 
or hierarchical relationship vis-a-vis any pair of target documents. Hence any 
triplet of points here, one of which is the query (defining the apex of the isosceles, 
with small base, triangle), defines local hierarchical or ultrametric structure. 
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Figure 3: The strong triangular inequality defines an ultrametric: every triplet 
of points satisfies the relationship: d(x, z) < max{d(x , y) , d(y , z)} for dis- 
tance d. Cf. by reading off the hierarchy, how this is verified for all x, y, z: 
d(x, z) = 3.5; d(x, y) = 3.5; d(y, z) = 1.0. In addition the symmetry and positive 
dcfinitcncss conditions hold for any pair of points. 



(See [26] for case studies.) 

It is clear from Figure [2] that we should use approximate equality of the long 
sides of the triangle. The further away the query is from the other data then 
the better is this approximation [26] . 

What sort of explanation does this provide for our conundrum? It means 
that the query is a novel, or anomalous, or unusual "document" . It is up to us to 
decide how to treat such new, innovative cases. It raises though the interesting 
perspective that here we have a way to model and subsequently handle the 
semantics of anomaly or innocuousness. 

The strong triangular inequality, or ultrametric inequality, holds for tree 
distances: see Figure [3} The closest common ancestor distance is such an ultra- 
metric. 



2.3 Hierarchical Agglomerative Clustering 

Since pairwise dissimilarities are used in constructing the hierarchy, the com- 
putation complexity of hierarchical clustering is at least 0(n 2 ). As the closest 
clusters (including singletons) are agglomerated at each of n — 1 agglomerations 
(card X = card I = n), the newly created cluster must be related to others. 
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This is part and parcel of the agglomeration criterion, and can be viewed ei- 
ther as the cluster update rule, or the agglomerative criterion (e.g., based on 
compactness, or connectivity). 

The most efficient algorithms are based on nearest neighbor chains, which 
by definition end in a pair of agglomerable reciprocal nearest neighbors. 0(n 2 ) 
computation time is guaranteed. The uniqueness and acceptability of on-the-fly 
agglomeration based on reciprocal nearest neighbors can be proven (respectively, 
disproven) for the given agglomerative criterion. The reciprocal nearest neigh- 
bor algorithm was first proposed in two articles in the journal Les Cahiers de 
I 'Analyse des Donne.es in 1980 and 1982, and are now used in software packages 
such as Clustan and R. Further information can be found in [33J H31 HH US] . 

2.4 Hierarchy as the Wreath Product Group expressing 
Symmetries 

A dendrogram like that shown in Figure [I] is invariant relative to rotation (al- 
ternatively, here: permutation) of left and right child nodes. These rotation (or 
permutation) symmetries are defined by the wreath product group (see [9j [7] 
for an introduction and applications in signal and image processing), and can 
be used with any m-ary tree, although we will treat the binary case here. 

For the group actions, with respect to which we will seek invariance, we 
consider independent cyclic shifts of the subnodes of a given node (hence, at 
each level). Equivalently these actions are adjacency preserving permutations 
of subnodes of a given node (i.e., for given q, with q = q' U q", the permutations 
of {q',q"}). We have therefore cyclic group actions at each node, where the 
cyclic group is of order 2. 

The symmetries of H are given by structured permutations of the terminals. 
The terminals will be denoted here by Term H. The full group of symmetries 
is summarized by the following generative algorithm: 

1. For level I = n — 1 down to 1 do: 

2. Selected node, v 4 — node at level I. 

3. And permute subnodes of v. 

Subnode v is the root of subtree H v . We denote H n _\ simply by H. For 
a subnode v' undergoing a relocation action in step 3, the internal structure of 
subtree H v i is not altered. 

The algorithm described defines the automorphism group which is a wreath 
product of the symmetric group. Denote the permutation at level v by P v . 
Then the automorphism group is given by: 

G = P n —i wr P n _2 wr ... wr ?2 wr P\ 

where wr denotes the wreath product. 
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Call Term H v the terminals that descend from the node at level v. So these 
are the terminals of the subtree H v with its root node at level v. We can 
alternatively call Term H v the cluster associated with level v. 

We will now look at shift invariance under the group action. This amounts to 
the requirement for a constant function defined on Term i?„,W. A convenient 
way to do this is to define such a function on the set Term H v via the root node 
alone, v. By definition then we have a constant function on the set Term H v . 

Let us call V v a space of functions that are constant on Term H v . Possible 
bases of V v that were considered in [27] are: 

1. Basis vector with |Termif n _i| components, with values except for value 
1 for component i. 

2. Set (of cardinality n = |Ternxff n _i|) of m-dimensional observation vectors. 
The constant function for each node or level v is: 

L : TevmH u — ► V v 

Consider the resolution scheme arising from moving from 
{Termff„/ , Termi7„» } to Termif^ . From the hierarchical clustering point of 
view it is clear what this represents, simply, an agglomeration of two clusters 
called Term H v i and Term H v n , replacing them with a new cluster, Term H v . 

Let the spaces of constant functions corresponding to the two cluster ag- 
glomerands be denoted V v > and V v ". These two clusters are disjoint initially, 
which motivates us taking the two spaces as a couple: (V V ', V u »). In the same 
way, let the space of constant functions corresponding to node v be denoted V v . 

Let us exemplify a case that satisfies all that has been defined in the context 
of the wreath product invariance that we are targeting. It is the algorithm 
discussed in depth in [27] ■ Take the constant function on V u > to be f v >. Take 
the constant function on V u " to be f v n . Then define the constant function, the 
scaling function, on V„ to be (/„< + / t/ ")/2. Next define the zero mean function, 
[w v i + w l/ >i)/2 = 0, the wavelet function, as follows: 

w v , = {f v ,+f u „)/2-f v , 
in the support interval of V v * , i.e. Term H u i, and 

in the support interval of V v ", i.e. Term H v n . 

Since w v > — —w u " we have the zero mean requirement. 

3 Approximation in an Ultrametric Topology 

We now seek to use a hierarchical clustering for successively approximating an 
object. In [28] we have examples of application to facial recognition and textual 
analysis. 
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Following a general view of hierarchical approximation in subsection 3.1 
we then proceed to an algorithm, and a data analysis framework, to support 
hierarchical approximation. 



3.1 Approximation from a Hierarchy: Dilation Operation 
as p-Adic Multiplication by l/p 

Scale-related symmetry is very important in practice. In this subsection we 
introduce an operator that provides this symmetry. We also term it a dilation 
operator, because of its role in the wavelet transform on trees (see [27J for 
discussion and examples). 

First we introduce a p-adic encoding of a hierarchy, using Figure [4] as an 
example. By means of terminal-to-root traversals, we define the following p- 
adic encoding of terminal nodes, and hence objects, in Figure [4] 
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- 1 


• p 5 
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+ 1 


p 6 


- 1 


P 7 






x s : 






-1 


p 6 


- 1 


P 7 







If we choose p = 2 the resulting decimal equivalents could be the same: cf. 
contributions based on +1 • p 1 and — 1 • p 1 + 1 • p 2 . Given that the coefficients 
of the p> terms (1 < j < 7) are in the set {— 1,0, +1} (implying for x\ the 
additional terms: +0 • p 3 + • p 4 + • p e ), the coding based on p = 3 is required 
to avoid ambiguity among decimal equivalents. 

Consider the set of objects {xi\i E 1} with its p-adic coding considered 
above. Take p = 2. (Non-uniqueness of corresponding decimal codes is not of 
concern to us now, and taking this value for p is without any loss of generality.) 
Multiplication of x 1 = +1 • 2 1 + 1 • 2 2 + 1 • 2 5 + 1 • 2 7 by l/p = 1/2 gives: 
+1 • 2 1 + 1 • 2 4 + 1 • 2 6 . Each level has decreased by one, and the lowest level 
has been lost. Subject to the lowest level of the tree being lost, the form of the 
tree remains the same. By carrying out the multiplication- by- l/p operation on 
all objects, it is seen that the effect is to rise in the hierarchy by one level. 

Let us call product with l/p the operator A. The effect of losing the bottom 
level of the dendrogram means that either (i) each cluster (possibly singleton) 
remains the same; or (ii) two clusters are merged. Therefore the application of 
A to all q implies a subset relationship between the set of clusters {q} and the 
result of applying A, {Aq}. 

Repeated application of the operator A gives Aq, A 2 q, A 3 q, Starting 

with any singleton, i £ I, this gives a path from the terminal to the root node 
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+1 



+1 



+1 



+1 



Figure 4: Labeled, ranked dendrogram on 8 terminal nodes, xi,X2, ■ ■ ■ ,xg. 
Branches are labeled +1 and — 1. Clusters are: q\ = {x\, X2}, q 2 = 
{x 1 ,x 2 ,x 3 },q 3 = {x 4 ,x 5 },q 4 = {x 4 ,x 5 ,x 6 },q 5 = {x 1 ,x 2 ,x 3 ,x 4 ,x 5 ,x 6 },q 6 = 
{x 7 ,x 8 },q7 = {xi,X 2 , ■ ■ -,X 7 ,X S }. 
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Sepal. L 


Sepal.W 


Petal.L 


Pctal.W 


1 


5.1 


3.5 


1.4 


0.2 


2 


4.9 


3.0 


1.4 


0.2 


3 


4.7 


3.2 


1.3 


0.2 


4 


4.6 


3.1 


1.5 


0.2 


5 


5.0 


3.6 


1.4 


0.2 


6 


5.4 


3.9 


1.7 


0.4 


7 


4.6 


3.4 


1.4 


0.3 


8 


5.0 


3.4 


1.5 


0.2 



Table 1: First 8 observations of Fisher's iris data. L and W refer to length and 
width. 

in the tree. Each such path ends with the null element, and therefore the 
intersection of the paths equals the null element. 

Benedetto and Benedetto [TJ [2] discuss A as an expansive automorphism of 
J, i.e. form-preserving, and locally expansive. Some implications lj of the ex- 
pansive automorphism follow. For any q, let us take q, Aq, A 2 q, ... as a sequence 
of open subgroups of I, with q C Aq C A 2 q C . . ., and I = \J{q, Aq, A 2 q, . . .}. 
This is termed an inductive sequence of I, and I itself is the inductive limit 

(|321, p. i3i). 

Each path defined by application of the expansive automorphism defines a 
spherically complete system J2H [101 122] > which is a formalization of well-defined 
subset embeddedness. 

3.2 Haar Wavelet Transform of a Dendrogram 

Determining successive approximations of data, based on the data itself, leads 
us to the Haar wavelet transform of a hierarchy, or on a dendrogram. 

The discrete wavelet transform is a decomposition of data into spatial and 
frequency components. In terms of a dendrogram these components are with 
respect to, respectively, within and between clusters of successive partitions. 
We show how this works taking the data of Table [T] 

The hierarchy built on the 8 observations of Table [l] is shown in Figure [5j 

Something more is shown in Figure [5] namely the detail signals (denoted 
±g?) and overall smooth (denoted s), which arc determined in carrying out the 
wavelet transform, the so-called forward transform. 

The inverse transform is then determined from Figure[5]in the following way. 
Consider the observation vector x<i. Then this vector is reconstructed exactly 
by reading the tree from the root: sj + dj = X2- Similarly a path from root 
to terminal is used to reconstruct any other observation. If x-i is a vector of 
dimensionality to, then so also are S7 and dy, as well as all other detail signals. 

This procedure is the same as the Haar wavelet transform, only applied to 
the dendrogram and using the input data. 
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s1-d1 



+d1 



-d6 
+d5 



s5 -d5 



s4 -d4 



+d4 



n s3 -d3 



+d3 



Figure 5: Dendrogram on 8 terminal nodes constructed from first 8 values of 
Fisher iris data. (Median agglomerative method used in this case.) Detail or 
wavelet coefficients are denoted by d, and data smooths are denoted by s. The 
observation vectors arc denoted by x and are associated with the terminal nodes. 
Each signal smooth, s, is a vector. The (positive or negative) detail signals, d, 
are also vectors. All these vectors are of the same dimensionality. 
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Sepal. L 
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0.253125 


0.13125 


0.1375 


-0.025 


0.05 


-0.025 


0.05 


Scpal.W 


3.603125 


0.296875 


0.16875 


-0.1375 


0.125 


0.05 


-0.075 


-0.05 


Petal.L 


1.562500 


0.137500 


0.02500 


0.0000 


0.000 


-0.10 


0.050 


0.00 


Petal.W 


0.306250 


0.093750 


-0.01250 


-0.0250 


0.050 


0.00 


0.000 


0.00 



Table 2: The hierarchical Haar wavelet transform resulting from use of the first 
8 observations of Fisher's iris data as shown in Table [T] Wavelet coefficient 
levels are denoted dl through d7, and the continuum or smooth component is 
denoted s7. 
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The data required to define this wavelet transform, for the data in Table [T] 
is shown in Table |2] 

The principle of "folding" the hierarchy onto an external signal is as fol- 
lows. The wavelet transform codifies the hierarchy. Having that, we apply the 
"codification" of the hierarchy with the new, external signal as input. 

Wavelet regression entails setting small and hence unimportant detail coef- 
ficients to before applying the inverse wavelet transform. 

More discussion can be found in |27j . 

3.3 Representation of an Object as a Chain of Successively 
Finer Approximations 

From the wavelet transformed hierarchy we can read off that, say, x\ = d-j + 
d 5 + d 7 + s 7 : cf. Figure [5j Or x$ = d 6 — d 7 + s 7 . These relationships use the 
appropriate vectors shown in Table [2] Such relationships furnish the definitions 
used by the inverse wavelet transform, i.e. the recreation of the input data from 
the transformed data. 

Thus, the Haar dendrogram wavelet transform gives us an additive decom- 
position of a given observation (say, x±) in terms of a degrading approximation, 
with a variable number of terms in the decomposition. The objects, or observa- 
tions, are those things which we are analyzing and on which we have (i) induced 
a hierarchical clustering, and (ii) further processed the hierarchical clustering in 
such a way that we can derive the Haar decomposition. In this section we will 
look at how this allows us to consider each object as a limit point. Our interest 
lies in our object set, characterized by a set of data, as a set of limit or fixed 
points. 

Using notation from domain theory (see, e.g., [5]) we write: 

s 7 C s 7 + d 7 C s 7 + d 7 + d 5 C s 7 + d 7 + d 5 + d 2 (2) 

The relation a C b is read: a is an approximation to 6, or b gives more 
information than a. (Edalat 6 discusses examples.) Just rewriting the very 
last, or rightmost, term in relation ^ gives: 

s 7 C s 7 + d 7 C s 7 + d 7 + d$ C x\ (3) 

Every one of our observation vectors (here, e.g., X\) can be increasingly 
well approximated by a chain of the sort shown in relations |2]) or ([3| , starting 
with a least element (s 7 ; more generally, for n observation vectors, s„_i). The 
observation vector itself (e.g., x\) is a least upper bound (lub) or supremum 
(sup) , denoted U in domain theory, of this chain. Since every observation vector 
has an associated chain, every chain has a lub. The elements of the "rolled 
down" tree, s 7 , s 7 + d 7 and s 7 — d 7 , s 7 + d 7 + d$ and 57 + ^7 — ^5, and so on, 
are clearly representable as a binary rooted tree, and the elements themselves 
comprise a partially ordered set (or poset). A complete partial order or cpo 
or domain is a poset with least element, and such that every chain has a lub. 
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Cpos generalize complete lattices: see |J] for lattices, domains, and their use in 
fixpoint applications. 



3.4 Approximation Chain using a Hierarchy 

An alternative, although closely related, structure with which domains are en- 
dowed is that of spherically complete ultrametric spaces. The motivation comes 
from logic programming, where non-monotonicity may well be relevant (this 
arises, for example, with the negation operator). Trees can easily represent 
positive and negative assertions. The general notion of convergence, now, is re- 
lated to spherical completeness ( [34l |T2] ; see also [T7j, Theorem 4.1). If we have 
any set of embedded clusters, or any chain, q^, then the condition that such a 
chain be non-empty, P|, q^ ^ 0, means that this ultrametric space is non-empty. 
This gives us both a concept of completeness, and also a fixed point which is 
associated with the "best approximation" of the chain. 

Consider our space of observations, X = {xi\i G /}. The hierarchy, H, or 
binary rooted tree, defines an ultrametric space. For each observation Xi, by 
considering the chain from root cluster to the observation, we see that H is a 
spherically complete ultrametric space. 



3.5 Mapping of Spherically Complete Space into Dendro- 
gram Wavelet Transform Space 

Consider analysis of the set of observations, {xi G X C R m }. Through use 
of any hierarchical clustering (subject to being binary, a sufficient condition 
for which is that a pairwise agglomerative algorithm was used to construct 
the hierarchy), followed by the Haar wavelet transform of the dendrogram, we 
have an approximation chain for each £ I. This approximation chain is 
defined in terms of embedded sets. Let n — card X, the cardinality of the 
set X. Our Haar dendrogram wavelet transform allows us to associate the set 
{^jll < i < ^ — 1} C M m with the chains, as seen in section 



3.3 



We have two associated vantage points on the generation of observation 
i,V«: the set of embedded sets in the approximation chain starting always with 
the entire observation set, /, and ending with the singleton observation; or the 
global smooth in the Haar transform, that we will call v n -li running through 
all details Vj on the path, such that an additive combination of path members 
increasingly approximates the vector Xi that corresponds to observation i. Our 
two associated views are, respectively, a set of sets; or a set of vectors in IR m . We 
recall that m is the dimensionality of the embedding space of our observations. 
Our two associated views of the (re)generation of an observation both rest on 
the hierarchical or tree structuring of our data. 
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4 Generalized Ultrametric 



4.1 Applications of Generalized Ultrametrics 

As noted in the previous subsection, the usual ultrametric is an ultrametric 
distance, i.e. for a set I, d ■ I X I — > K (so the ultrametric distance is a 
real value). The generalized ultrametric is: d : I x I — > T, where T is a 
partially ordered set. In other words, the generalized ultrametric distance is 
a set. With this set one can have a value, so the usual and the generalized 
ultrametrics can amount to more or less the same in practice (by ignoring the 
set and concentrating on its associated value). After all, in a dendrogram one 
does have a set associated with each ultrametric distance value (and this is most 
conveniently the terminals dominated by a given node; but we could have other 
designs, like some representative subset or other, of these terminals). Remember 
that the set, T, is defined from the original attributes (which we denote by the 
set J) ; whereas the sets of observations read off a dendrogram are subsets of the 
observation set (which we label with the index set I). So T = 2 (and not 2 7 ). 

In the theory of reasoning, a monotonic operator is rigorous application of 
a succession of conditionals (sometimes called consequence relations) . However: 
"In order to deal with programs of a more general kind (the so-called disjunctive 
programs) it became necessary to consider multi- valued mappings" , supporting 
non-monotonic reasoning in the way now to be described ([30], pp. 10, 13). 
The novelty in the work of [30l E] is that these authors use the generalized 
ultrametric as a multivalued mapping. 

(A more critical view of the usefulness of the generalized ultrametric per- 
spective is presented by [15], 1 

The generalized ultrametric approach has been motived |35j as follows. "Sit- 
uations arise ... in computational logic in the presence of negations which force 
non-monotonicity of the operators involved". To address non-monotonicity of 
operators, one approach has been to employ metrics in studying some problem- 
atic logic programs. These ideas were taken further in examining quasi-metrics, 
and generalized ultrametrics i.e. ultrametrics which take values in an arbitrary 
partially ordered set (not just in the non- negative reals). Seda and Hitzler [35] 
"consider a natural way of endowing Scott domains [see [1]] with generalized ul- 
trametrics. This step provides a technical tool [for finding fixpoints - hence for 
analysis] of non-monotonic operators arising out of logic programs and deductive 
databases and hence to finding models for these." 

A further, similar, viewpoint is [12 : "Once one introduces negation, which is 
certainly implied by the term enhanced syntax ... then certain of the important 
operators are not monotonic (and therefore not continuous), and in consequence 
the Knaster-Tarski theorem [i.e. for fixed points; again see [I]] is no longer 
applicable to them. Various ways have been proposed to overcome this problem. 
One such [approach is to use] syntactic conditions on programs ... Another 
is to consider different operators ... The third main solution is to introduce 
techniques from topology and analysis to augment arguments based on order ... 
[latter include:] methods based on metrics ... on quasi-metrics ... and finally ... 
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Table 3: Example dataset: 5 objects, 3 boolean attributes. 
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on ultrametric spaces." 

The convergence to fixed points that are based on a generalized ultrametric 
system is precisely the study of spherically complete systems and expansive 
automorphisms discussed in section |3.1| As expansive automorphisms we see 
here again an example of symmetry at work. 

4.2 Link with Formal Concept Analysis 

In this subsection, we consider an ultrametric defined on the powerset or join 
semilattice. Comprehensive background on ordered sets and lattices can be 
found in [3]. 

As noted in section [2] typically hierarchical clustering is based on a distance 
(which can be relaxed often to a dissimilarity, not respecting the triangular 
inequality, and mutatis mutandis to a similarity), defined on all pairs of the 
object set: d : I X I — > R + . I.e., a distance is a positive real value. Usually 
we require that a distance cannot be 0-valued unless the objects are identical. 
That is the traditional approach. 

A different form of ultrametrization is achieved from a dissimilarity defined 
on the power set of attributes characterizing the observations (objects, individ- 
uals, etc.) X. Here we have: d : X x X — > 2 J , where J indexes the attribute 
(variables, characteristics, properties, etc.) set. 

We consider a different notion of distance, that maps pairs of objects onto 
elements of a join semilattice. The latter can represent all subsets of the at- 
tribute set, J. That is to say it can represent the power set, commonly denoted 
2 J , of J. 

As an example, consider, say, n = 5 objects characterized by 3 boolean 
(presence/absence) attributes, shown in Table [3j 

Define dissimilarity between a pair of objects in Table [3] as a set of 3 com- 
ponents, corresponding to the 3 attributes, such that if both components are 
0, we have 1; if either component is 1 and the other 0, we have 1; and if both 
components are 1 we get 0. This is the simple matching coefficient |14) . We 
could use, e.g., Euclidean distance for each of the values sought; but we prefer 
to treat values in both components as signaling a contribution. We get then: 
d(a,b) = 1,1,0 
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Potential lattice vertices 



Lattice vertices found 
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The set dl,d2,d3 corresponds to: d(b,e) and d(e, /) 

The subset dl,d2 corresponds to: d(a, 6), d(a, f),d(b, c), d(b, /), and d(c, /) 
The subset d2,d3 corresponds to: d(a 7 e) and d(c, e) 
The subset d2 corresponds to: d(a, c) 

Clusters defined by all pairwise linkage at level < 2: 

a, b, c, f 



Clusters defined by all pairwise linkage at level < 3: 
a,b, c,e,f 

Figure 6: Lattice and its interpretation, corresponding to the data shown in 
Table [3] with the simple matching coefficient used. (See text for details.) 



d(a,c) = 0,1,0 
d(a,e) =0,1,1 
d(a,f) = 1,1,0 
d(b,c) = 1,1,0 
d(b,e) = 1,1,1 
d(b,f) = 1,1,0 
d(c,e) =0,1,1 
d(cj) = 1,1,0 
d(e,f) = 1,1,1 



If we take the three components in this distance as dl, c?2, c?3, and considering 
a lattice representation with linkages between all ordered subsets where the 
subsets are to be found in our results above (e.g., d(c, f) = 1,1,0 implies that 
we have a lattice node associated with the subset dl,d2), and finally such that 
the order is defined on subset cardinality, then we see that the representation 
shown in Figure [6] suffices. 

In Formal Concept Analysis [3J [TTJ [TS], it is the lattice itself which is of 
primary interest. In [14] there is discussion of, and a range of examples on, the 
close relationship between the traditional hierarchical cluster analysis based on 



a, e 



c, e 
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d : I X I — > R + , and hierarchical cluster analysis "based on abstract posets" (a 
poset is a partially ordered set), based on d : I x I — > 2 J . The latter, leading to 
clustering based on dissimilarities, was developed initially in [15] . 

5 Conclusion 

Data analysis allows us to go from measured data to a computational path or a 
set of approximations used to represent the objects of analysis. We have noted 
that examples of application to face recognition and to documents can be seen 
in [2S]. 

Computational logic in an analogous way used metric and ultrametric em- 
beddings. Within such topologies, computation is carried out. We have focused 
in this article on ultrametric embedding, i.e. given as a hierarchy or tree. 

It is interesting, and without question exciting, to envisage further cross- 
linkage between data analysis and computational logic. 
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